THE CAMBRIDGE LANGUAGE SURVEY 
r as b 


ean 


3. Objectives 


4, Areas of Research In progress and contemplated 


@ 41i  Comcordancing 
42 Core Vocabulary 
43 Machine tractable dictionary databases 
44 Semantic Relations 
4.5 Corpora categorised to demonstrate frequency of meaning 
4.6 Cultural data 
4.7 Language variety 
4.8 Words in Groups 
4.9 Collocation 
4.10 Towards a universal linguistic coding - 
4.11 Marking of morpheme boundaries 
4.12 Classified interlanguage interference data 
4.13 Neologisms and proper names 


5. Products and Benefits of the Survey 
6. Potential Sponsors and Other Contacts 
CO) 7  CLS Panel 


Paul Procter } 

Senior Editor, International Dictionaries a 
Cambridge University Press 

Edinburgh Building 

Shaftesbury Road 

Cambridge CB2 2RU 

Daytime telephone: +44 (0)223 325880 

Evening telephone: +44 (0)920 465890 

Facsimile: +44 (0)223 315052 

Electronic mail: psp10@uk.ac.cam.phx 


(outside UK) psp10@phoenix.cambridge.ac.uk 
à « Telex: 817256 


Chancellor: H.R.H. The Prince Philip, Duke of Edinburgh, Hon. LL D 


‘The Edinburgh Building Shaftesbury Road Cambridge CB2 2RU Telephone 0223 312393 Facsimile 0223 315052 


1. INTRODUCTION 


MAIN CONTACTS IN THE U.K. 
Sir John Lyons — Chairman 


PaullProcter, Cambridge University Press (formerly editor of Longman Dictionary of 
Cont mporary English - LDOCE) — Project Coordinator 


Ted Briscoe, Natural Language Processing Group, Cambridge University Computer 
Laboratory 


Sidney Greenbaum, International Corpus of English (including the Survey of English 
Usage — SEU), University College London 


Reinhard Hartmann, Dictionary Research Centre, University of Exeter 
Stephen Pulman, SRI, Cambridge 


ACQUILEX 


University of Amsterdam, Department of English 
Universitat; Politecnica de Catalunya, Facultat de Informatica 
Istituto di Linguistica Computazionale, CNR, Pisa 
Universita di Pisa, Dipartimento di Linguistica 

Biblograf (Vox), Barcelona 

Van Dale Lexicografie, Utrecht 


The foundations for a successful consortium in the UK are being laid through liaison with 
several parties, notably with the Natural Language Processing Group of Cambridge 
Computer Laboratory at the University of Cambridge, and with the International Corpus 
of English, | centred at University College London. This is leading naturally to the 
development of the international consortium in association with the overseas contacts of 
these two ioc In the case of the former, the early involvement in the final stage of the 
ESPRIT(EC)-funded ACQUILEX project is the catalyst, and is leading on to a next stage 
in conjunction with ACQUILEX partners in Holland, Spain and Italy. In the case of the 
latter, close links are being created with the team at the SEU, with cooperation in 

are eee chiefly in the area of corpus concordancing and retrieval. 


WHAT ARE THE CHIEF DISTINGUISHING FEATURES OF THE SURVEY? 


1) CLS is multilingual, where possible treating all the languages involved (English, 
French, German, Spanish, Italian, Dutch, Japanese) with equal rigour. This is the 
eae from the start, although inevitably at start up, some languages will be 
treated first while partnerships with others in different countries are being 

established. For example, ACQUILEX involves only 4 of these languages. 


2) CLS is international, relating to geographic areas where the languages arc spoken 
and concerned with language variety, and with cultural differences within 


1 ges and language groups. 


3) is eclectic and is in a process of evolution in activities, goals and purposes. 
This means that CLS aims to use whatever techniques are established or become 
available through technological developments. 


4) CLS provides for the exchange of skills, tools and materials between partners in 
the consortium on an ongoing basis, whether academic or industrial (including 
publishing) partners. 


5) CLS is not in competition with any existing rescarch projects, but aims to 
perate and exchange information and materials wherever possible, on an 
equitable basis. 


2. OVERVIEW 


Cambridge University Press (CUP), a department of the University of Cambridge, U.K., 
is developing a range of language reference publications with a view to book and 
electronic publication, and has begun the process of collecting and analysing language 
data. Varjous corpora and sources are being investigated, and we are building up corpus 
materials. The programme includes both monolingual and bilingual publications. A team 
of lexicographers, language researchers and programmers has been established. 


We are now in the process of developing language resources, in conjunction with potential 
partners. Governments, universities and industrial organisations in a number of countries 
and technology areas are engaged in language rescarch. For some of these (for example, 
in the fields of communication and information technology), natural language processing 
(NLP) is a core area of activity; for others in other product areas, language research is of 
interest mainly in enabling them to carry out their cross-border business more efficiently 
(as in assisting the processes of producing documentation in many languages by 
machine-assisted translation aids). 


By pooling financial and human resources and sharing the fruits of research, CLS aims to 
meet the aspirations of all partners while reducing duplication of effort. The concept is of 
an ‘Information Resource’ which can be shared with people throughout the world without 
diminution of its value, and with enhancement through fruitful cooperation. 


Although the methods employed by the survey are eclectic, using computer-based systems 
alongside manual collection, and various hybrids of computer-assisted manual operation, 
the results are to be incorporated into a unified structure for access and dissemination in 
machine-readable form. 


In each country, we are establishing relationships with language tescarch organisations, 
publishers, companies and government agencies. 


What is the Cambridge Language Survey? 


There have been many surveys of English — the great dialect surveys, the linguistic 
studies of contemporary English (notably the SEU), and of course lexicographic citation 
banks, and (in recent years) computerized corpora for linguistic studies (Brown etc.) and 
for lexicography (Cobuild etc.). CLS does not attempt to duplicate these projects, and 
indeed will wherever possible share findings and materials. CLS differs mainly in being 
multilingual and international, and in using a wide range of techniques and systems. 


The choice of title, Cambridge Language Survey, indicates a breadth of activity, in. 
particular that English, although very important, is not placed centrally as the only 
language to be collected and studied. 


Survey denotes much more than Corpus, where the emphasis is on data. Survey covers 
data, collection systems, and software systems for analysis. 


International denotes multicultural and multilingual, and the Survey's focus, while 
concerned with English in its various national varieties, places studies of other languages 
equally high up in its priorities, and has as a central principle the concept that languages 
can be most usefully discussed comparatively. The Survey will therefore not just be 
looking at insights into English obtained through comparison with other languages, but 
also at comparable insights between all the languages studied providing benefits to all the 
partners. 


| 5 


as OBJECTIVES 


The iiin vary between the different kinds of partner, each of whom will have 
differing interests and requirements, depending on types of market and product:- 


PUBLISHING PARTNERS 


These are interested mainly in bilingual and multilingual publishing in book and electronic 
form, and in tools to assist the publishing process, such as software tools and production 
systems. 


ACADEMIC PARTNERS 


The focus here is on data and systems for scholarly research, including multilingual 
language corpora, and tools for their analysis. 


| | 
INDUSTRIAL PARTNERS 


The survey will help to build products and services, and to exploit technological advances 
as they occur in creating new product opportunities in various countries. Tools of the 
survey should help to lead to internal organisational economies, as in providing translation 
aids for technical manuals. Computer companies will derive benefit from materials which 
can be incorporated into commercial software. 


4. AREAS OF RESEARCH IN PROGRESS AND CONTEMPLATED 


We believe that there is a consensus that a number of tasks need addressing to enable 
language research to progress in ways which will be of benefit to scholars, people 
concerned with information dissemination and processing (including publishers of books or 
electronic media), and industrial groups needing such research for product development. 


Among these so far identified are:- 


4.1 Concordancing 


For those not familiar with the concept of concordancing, it is the process of sorting all 
the words in a text so that all instances of one word form fall together, for example:- 


. printed some story about strikes, actually, the magazine did and he thought .... 
out of action by first-shot strikes, and a company of Soviet infantry... ~ ma 

ha „sooner or later the hour strikes, and if they must rest where better... .. Se 

ere the mouth. When the snake strikes, its mouth opens wide, the bone to which . . . . - 


Most existing concordancing software appears to have limitations either in functionality or 
in the ability to manage large volumes of data efficiently. Concordancing is an essential 
tool for any kind of language research, and the establishment of effective software is a 
prerequisite for other activities. Support for the development of concordancing software is 
therefore an important part of the Survey. This software must allow, for example:- 


The handling of large enough pieces of text 
The ability to trace various types of words in groups (collocations) 
Speedy presentation on an interactive basis of extracted data 


We are developing and adapting existing tools by working closely with Sidney y 
Greenbaum’s software team at the SEU and with Ted Briscoe’s team at the Cambridge 
Computer Laboratory. 


A main application of concordancing in the CLS will be the creation of a preliminary 
concordance by machine, followed by assigning linguistic codes (e.g. semantic and 
syntactic codes) to the word forms (an editorial task) to give genuine semantic frequency 
counts. Good concordance software has numerous potential applications, for example it 
forms the basis of retrieval tools for texts supplied on CD-ROM. 


The manual coding of a corpus with syntactic and semantic tags on the lexical items is:- 


1) necessary because it is not possible to achieve this with the appropriate level of 
precision and coding detail by automatic means at present 


2) a time-consuming task requiring the most efficient machine-supported input tools. 


Where there is an existing corpus of data, such as the SEU, the coded data will be passed 
back for use in research. 


The collection of large quantities of textual data has hardware and software implications 
(e.g. will need the use of storage devices such as CD-ROM / WORM drives, and will 
benefit from data and text compression software techniques to make maximum use of 
conventional hard-disk technology). 


Concprdancing software will need to be capable of restoring inflected forms to the base 
form (lemma), e.g. appropriate instances runs, running, ran to the verb run. This 
‘lemmatiza jon’ will probably be able to build on existing algorithms for the various 
languages rather than developing new ones. 


The requir i ents of ‘words in groups’ necessitate the efficient finding of groups of words 
that may be separated from each other by intervening words, e.g. in an English phrasal 
verb ns in the sentence ‘He put John up to making the application’. 


The frequency of all types of collocation with particular meanings of particular word forms 
will be analysed statistically so as to provide tools for decoding natural language in 

addii ion to upporting electronic dictionaries and software tools such as grammatical 
taggers. 


Dynamic Corpora 


There has been much debate recently on the subject of selection of materials for the 
building of corpora between those favouring a selection of types of language at the start, 
intaining predetermined percentages of types, e.g. formal spoken, telephone intimate, 
ete., jand those who are prepared to absorb large quantities of data more or less 
oppartunistically. 
CLS follows the position that some scholars are now favouring, namely that there is a real 
benefit in gathering widely as materials become available, but that software tools are 
developed to allow the individual researcher to weight findings in an on-line environment, 
depending on the kinds of research being undertaken. 


4.2 Core Vocabulary 


Here, as in many fields of investigation, there is a degree of consensus between those 
interested in pedagogic needs, and those in computational linguistics. There is a valuable 
tradition in publishing for the foreign learner of English, going back to Michael West’s 
General Service List of English Words, of providing reference and teaching texts in 
controlled vocabularies. West based his work on semantic frequency counts undertaken by 
Irving Lorge of Columbia University. There seems not to have been a significant semantic 
frequency count since Lorge’s work (pre-1949 on 2.5 million words of running text), and 
the list is significantly out-of-date. It is surprising during the age of the computer that the 
task of producing a semantic frequency count for today has not been undertaken, 
especially considering the enormous value in publishing terms of West’s heritage. Many of 
the recent word-form-count studies such as Brown do not in their primary presentation 
distinguish between meanings, so that polysemous word forms are given the same 
weighting as more frequent monosemous word forms. 


Computational linguists are interested in the related subject of ‘sublanguages’. To quote 
Chapter 3 (by Lynette Hirschman of MIT) of the forthcoming book from Cheng-ming 
Guo entitled Machine Tractable Dictionaries: Design and Construction:- 


Researchers have frequently described sublanguages, rather than the full language, 
in order to ‘close’ or limit language. We define sublanguage here as the specialized 
use of language for communication ... Computational linguists have focused on 
sublanguage precisely to escape the problems of handling the full, open language. 


The systematic comparative study of the most frequent meanings in a range of languages 
would be one of the intentions of CLS, and it is expected that this would yield very useful 
insights, and assist with the processes of automatic parsing, and eventually machine 
translation (MT). The languages to be covered in the first phase are English, French, 
German, Dutch, Italian, Spanish and Japanese. 


The steps taken will be:- 


1. ‘Within each language, to establish a core set of high-frequency meanings through 
computer concordancing and hand-marking with semantic codes. 


2. To cross-map equivalent meanings among the languages of the group. 
3. To document, in some manner, degree of equivalence where one-to-one mappings 
are inappropriate. 


This will result in appropriate semantic links on computer between items in different 
languages. Part of the results of these studies wil} be an adequate documentation of lexical 
interference (false friends), where work has already begun at Cambridge University Press 
with some useful initial results. 


43 Machine Tractable Dictionary Databases in each language for parsing of natural 
language | 


A machine tractable dictionary is an electronically stored text with coded linguistic 
features that can be consulted for each lexical item in a piece of natural language to assist 
in the process of decoding syntactic structure and meaning. A central feature of the survey 
is the creation of computerised electronic dictionaries for cach language for use as a 
primary tool in language analysis. 


These dictionaries would be a primary tool for automating the process of assigning 
meanings rear forms in running text, based on a richer set of linguistic coding than is 
available in| the Longman Dictionary of Contemporary English (1978) — (LDOCE) — 
which is currently in widespread use. Such dictionaries would improve and extend the 
codes available in LDOCE in order to overcome some weaknesses of overall structure, 
and of implementation. On the question of structure, to quote from Pustejovsky, Chapter 
2 in Guo, in relation to dictionaries generally: 


Current lexicons for natural language systems and linguists alike reflect, through 
their organisation, the traditional view of word senses. In particular, they assume 
that the space of possible uses of a word is exhaustively carved out by an 
enumerated set of senses for that word. This is inadequate for several reasons. 
First, it cannot account for the ‘creative’ use of words, that is, the way in which 
words acquire new senses from novel contexts. Second, it is unable to account for 
the permeability’ of word senses, namely, that many words don’t have distinct 
senses at all, but are complex overlapping meanings. Finally, it says nothing about 
the [predictability of syntactic forms for a word. Thus, if a word participates in 

istinct syntactic relations, why should it be assigned a distinct word sense as well, 
as is usually the case? 


In addition, there are a wide range of areas where the architecture of the coding in 
LDOCE has yielded insights which will enable a much better coding structure to be 
adopted. 


On the point of implementation, see Chapter 17 of Guo: 


The semantic codes are often too general, and the subject codes are too often 


issing i misapplied codes in LDOCE (1978) have never been corrected, and the later 
edition omits much of the useful data. 


The hierarchy of genus words in definitions, relating the specific to the more general 
through successive definitions, was not a part of the LDOCE coding, and attempts to 
automate tie procedure have already proved to be error-prone. Such coding, as with the 
meaning-assi ent referred to above in relation to concordanced texts, is probably best 
dont editdrially with machine assistance. . 


An example of the marking of such a hierarchy, using definitions from LDOCE, isi- 


dachshund a small *dog* with short legs etc. 
dog a common four-legged *animal*, esp. etc. 
animal f a living *creature*, not a plant etc. 
creature a living *being* of any kind etc. 


being a living *thing*, esp. ete. 


10 


4.4 Semantic Relations 


Various kinds of semantic relationship can be documented systematically, other than the 
one of hierarchy implicit in the genus words of definitions. If automatic look-up 
dictionaries are created for several languages, then a relational database will permit the 
linking of lexical items having the same or a broadly similar meaning. In addition, it will 
be useful to create various kinds of cross-link between connected items on the grounds of 
synonymy, antonymy and other relations. This will be an ongoing process, and trial and 
error should reveal the most promising approaches. Other sorts of link, as with 
etymological cognates, could also be handled in a similar way. 


The sort of semantic relations that will need categorization, apart from synonymy, will 
include the main types identified by semanticists (c.g. complementarity, converseness, 
etc.) and synonyms with different register and/or part of speech (e.g. tooth/dental in 
English). 


In relating lexical items to each other between languages, consideration will need to be 
given to a non-language-specific classification system to which the various languages can 
be related (see 4.10 below). 


4.5 Corpora categorised to demonstrate frequency of meaning 


I am using the term corpora in a broader sense than computational. Linguists and 
lexicographers have developed manual methods of data gathering which remain valid and 
avoid some of the pitfalls of computer-based activities. As mentioned earlier, CLS should 
be eclectic and would thus gather all sorts of data, including:- 


Manually collected instances of usage of native speaker text and speech. 


Collection of non-native-speaker text for analysis in order to produce information 
on language interference. 


Concordanced materials from a whole range of linguistic genres, for example: 
Across language variety, e.g. British English, Argentinian Spanish 
Across types of text, e.g. novel, textbook 
Across timescales, e.g. pre-1945, post-1945 


Across subjects: this involves the collection of reading materials 
recommended by subject consultants with a sensible 
overview of subjects exhibiting cultural diversity and/or 
rapid change (e.g. scientific, technical, legal, political) 


The language of film, TV, pop music, etc. 
The advice of those involved in making selections of this kind from the academic or 
professional lexicographic world would be taken. The problems of copyright and 


permissions would also require systematic handling, and again advice from those who have 
had to cope with this is being taken. 


11 


4.6 Cultural data 


We will wish to collect a large body of material excmplifying cultural assumptions and 
allusions built into the various languages. These will provide ‘real-world’ data alongside 
the linguistic data. 


One valuable source of cultural preoccupation of languages is existing monolingual and 
bilingual dictionaries, where the choice of words for the headword list (and for the 
defining vocabulary) can provide guidance as to cultural preoccupations of a language 
group. An extreme example of this can be found by looking at languages remote from 
Indo!European (such as Tibetan, e.g. entries for pillar; cow-flesh; leather; dung; large, 
great, little/drum; etc.). 


Similarly, as has been pointed out by Morton Benson with reference to Slavic languages, 
textyal examples in dictionaries can provide insights into cultural / political differences, 
e.g. the Russian dictionary example “the stores were filled with merchandise” would be 
unlikely toloccur in dictionaries from European countries where this is taken for granted! 


4,7 Language Variety 


An important aspect of cross-language comparisons of CLS will be concerned with 
differences] of language variety between languages for the same semantic areas (e.g. 
differences] of formality level). 


4.8 Words in Groups 


Words in Groups is adopted as a general term for lexical items consisting of more than 
one graphic word, and embraces different kinds of language including idioms, phrases, 
collécations, proverbs, quotations and allusions. The adequate documentation of these 
seems a neglected area. Concordance software must be developed (see 4.1 above) to 
reduce the! manual task of finding words in groups. An essential part of this task will be 
included in the statistical analysis of how one word collocates with others (see 4.9 below). 


4.9 Collocation 


Parsing software often looks at syntactic and/or semantic patterns to determine word 
meanings and grammatical behaviour in a stream of text. A promising approach to enable 
decoding to be performed satisfactorily is to include as one of the main tools a precise 
statistical backup giving the likelihood of a particular meaning of a word form associating 
with adjacent lexical items. Such software will be developed as part of the concordancing 
package. 


4 Ts a Universal Linguistic Coding 


There is interest among lexicographers and computational linguists to agree on a common 
coding system to classify linguistic and lexicological features. 


The ‘bare bones’ of this coding is in development, and the description given here is an 
outline of what we believe is going to be uscful:- 


O 


1) 


2) 


3) 


12 


Coding is hicrarchical, using pop-down menus to select the appropriate level, and 
also allowing for the possibility of multiple codes of any one type to be attached, 
for example for subject area there are about 12 top levei codes identificd by a 
single letter, beneath which are two letter codes, beneath that three-letter codes, 
and so on as required — e.g. S (sports and games), NT (net games), TEN 
(tennis). 


There is no attempt to produce a single homogencous coding system for all 
categorics of meaning. Parts of spcech have different requirements, and within part 
of speech there will be broad categories which have a different coding structure 
applied to them, e.g. concrete nouns are to be treated differently from 
non-concrete. 


The types of coding to be developed include:- 
a) syntactic and grammatical (e.g. verb complementation) 


b) semantic with subcategories subject 
thesaurus (synonym etc. sets) 
selectional restrictions (e.g. subject / object) 
semantic relations 
locality (continent, country, compass points) 
climate (e.g. arctic, tropical) 

c) morphological (e.g. inflections, combining potential, morphemes, part of speech 
components of compounds) 


d) orthographic (e.g. number of syllables, consonant clusters) 


c) phonological / stress 
f) etymological (e.g. language of origin, cognates, language described) 
g) style national restriction 


regional 
level (e.g. formality) 
attitude (e.g. derogatory) 


h) status (e.g. neologism, proper name, cross-reference) 


i) collocation 
j) frequency 


There are numerous other possibilities. 


4) 


The heterogencous nature of codes may mean that the same codes are employed 
in different ways in different categories. For example, concrete nouns will have 
hierarchical semantic codes (such as animate) together with added flags for 
features such as size, group of, etc. Abstract nouns will have Roget-like primary 
classification plus flags for such features as intensity. A first cut for the categories 
for abstract nouns might include:- d 


emotions / feelings 

acts and activities 

events and processes 
guality / property 
location 

belief / ideology / system 
state / condition 
relationship 
classification 

etc. 


13 


Concrete nouns may carry a set of types such as:- 
function / purpose 
form / shape 
size 
colour 
te: € 
taste 
method 
owned by 
location 
material / substance 
ete. 


The coding! of text corpora will utilise and develop commonly agreed systems of coding 
based on SGML (standard general mark-up language). 


4.11 Marking of Morpheme Boundaries 


Some system of marking morphemic boundaries in corpus data and in the dictionaries, O 
e.g. in compounds, and to show affixes and inflectional endings, needs to be developed. 

This might|be similar to the syllabic dividers used in LDOCE and various dictionaries. 

This will permit morphological decomposition of semantically open compounds, for 

example. 


4.12 Classified Interlanguage Interference Data 


CLS will build on materials: gathered from existing projects, where access is granted (e.g. 
the project funded by the Swiss Council for Scientific Research for French, German and 
yatna spetter errors in English). Also, existing bilingual proofchecking software isa 
so! of material, as are published dictionaries of false friends and studics of structural 
interference problems. 


4.13 Neologisms and proper names 


Taking in text data from newspapers and other sources raises the considerable problem of 
coping with word forms not previously encountered. This will include proper names, new 
word forms and typographic errors, some of which will need to be automatically added to 
on-line dictionaries. Collaboration will be undertaken with those involved in developing 
software tù analyse these and construct, where appropriate, lexicon entries for them. This 
is done for example by analysing appositional phrases such as “ICI, the largest chemical 
company in the UK, etc.” 


14 


5. PRODUCTS AND BENEFITS OF THE SURVEY 

Possible products of the survey:- 

analysed corpora in the various languages, parallel and aligned 
machine-tractable dictionary databases 

software tools, for a whole range of applications 


electronic products for the whole range of current and future systems, including 
information sources (including dictionaries and encyclopedias) on CD-ROM, floppy disc 
and tape, CDI, hand-held solid state, hand-held disc-based, etc. 


It is expected that the material and results of the survey, both qualitative and quantitative, 
will provide benefits:- 


a) Providing an on-line database of language information for areas of computational 
product development including automatic parsing and semantic analysis. 


b) Being a resource for educational and publishing purposes, for electronic and 
non-electronic delivery. 


c) Access to a large and growing body of data in the form of multilingual linguistic 
corpora, suitably coded and concordanced for retrieval. 


d) Access to growing lexicons in the form of Lexical Data Bases (LDBs) for each 
language, suitable, in conjunction with software for use in rescarch and product 
development. j 


e) Access to software in the form of relational databases and analytic tools for 
analysis of Lexical Knowledge Bases (LKBs). 


53] Links between prestigious academic institutions throughout the world, the 
opportunity to further scholarly research in the important areas of language and 
communications, and the opportunity to be associated publicly with these activities, 
along with others. i 


g) Opportunities for product development in a broad range of arcas, many still to be 
identified. 


FEED. ARONDERSOER FBF 


ANTWOORDVEL 


(GEBRUIK ASSEBLIEF DIE AGTERKANT VAN HIERDIE VEL VIR VOLLEDIGE KOMMENTAAR OOR DIE 
O EN/OF SKRYF OP FOTOKOPIES VAN SPESIFIEKE BLADSYE EN STUUR HIERDIE BLADSYE SAAM 
EG U REAKSIE EN GEDAGTES SAL OPREG WAARDEER WORD) 


Beste 


Ek stuur graag aan u besonderhede oor ons taalondersoek. Indien u wil hê dat ons vir u gereeld die jongste 
inligting oor die ondersoek moet stuur, moet u asseblief hierdie vel aan ons terugstuur. Ons wil ook hê dat 
u die vrymoedigheid moet neem om die spasie hieronder te gebruik om ons te vra om inligting aan enige 
van u kollegas te stuur wat geinteresseerd mag wees in ons aktiwiteite (fotokopieer asseblief addisionele 
velle n nodig). Indien u dit verkies, kan u KOPIEË VAN HIERDIE DOKUMENT aan kollegas verskaf 
of j kopieë van ons aanvra. 


TITEL (Mr, Mrs, Ms, Dr, Prof, €t0)....scsssessscsssesssesesteseerrensensens 
IN ALAM boere EE EE nederavuenosoeinscsoses 
ADRES. 


TELEROONNOMMER......e vee seee esse se ee ae se Be se be ga ge sege ee sn ee see ae EE EE RR EE seven 
FAKSNOMMER......ssrccsssessasessscsenesessenencnesesonscenes ER EN 
ELEKTRONIESE POS......uusesesese se se reses EE EE EE 


TITEL; (Mr, Mrs, Ms, Dr, Prof, etc). 
NAAM. 
ADRES... 


TELEFOONNOMMER. 


| (spesifiseer asseblief taal) 


MAAK ASSEBLIEF 'N REGMERK IN DIE BLOKKIE INDIEN VAN TOEPASSING 


TA ASSEBLIEF TERUG AAN: 

Paul Procter Telefoonnommer bedags: 0223 325880 

‘Senior Editor, International Dictionaries Telefoonnommer saans: 0920 465890 

Cambridge University Press Faksnommer: 0223 315052 

Edinburgh Building Elektroniese pos: psp10@uk.ac.cam.phx (Binnelands) 
Shaftesbury Road psp10@ phoenix.cambridge.ac.uk (Buitelands) 


‘Cambridge CB2 2RU Teleks: 817256 


15 


Applications of the research include:- 


EDUCATION 


Education (user needs) 
Language interference studies 
Specialised lexicons 
Language acquisition 
Translation aids and systems 


NATURAL LANGUAGE PROCESSING 

Parsing (semantic and syntactic analysis) 

Text understanding 

Text gencration 

Discourse understanding 

Machine-assisted translation leading to machine translation 
Speech recognition and synthesis 


PUBLISHING 
Book and non-book media (see 5 above) 


6. P! Ee SPONSORS AND OTHER CONTACTS 


Companies | institutions, publishers and industrial partners being approached (most known 
to be wills at some level or other, in language research) 


AT 
Apple 
BBC Monitoring Services 
Bell 

British Telecom 

Canon 

Compaq 

Data General 

DEC 


18 


De! 
DI (British Department of Trade and Industry) 
Epson 
European Commission: Acquilex, Eurotra 7, Multilex, Genilex (Eureka) 
General Motors 
Haskyns 
Hewlett Packard 
a 
(Fujitsu) 
Japle: EDR 


El 

Meade Data Central 
Microsoft 
Mitsubishi (Apricot) 
Monotype 
Novell 
Pan! American Health Organisation 
Panasonic 
Philips 
Praka 
Rank-Xerox 
Revelation Technologies 

eider 
Sharp 
Sane 

S (University of Cambridge Local Examinations Syndicate) 

Vinee States: DARPA, NSF, TEI 
Wang 
Xerox 


17 


7. CLS PANEL 


ACLS panel will be cstablished in the near future, under the chairmanship of John 
Lyons. This will have a small number of officers (president, chairman, secretary) and a 
group of invited members. The panel will meet only rarely, and meetings will be 
organised as far as possible to coincide with visits to Britain by overseas members. 


Choice of panel members (taking account, where possible, of gender balance) 


Principles of selection 


Representatives from main language groups:- scholars working in main subdisciplines of 
each language, e.g. grammar, semantics. 


Representatives from U.S. and Australasia, and later from other areas where English is 
the L1 or an important L2. 


Representatives of ‘communications industry’ including journalism, film, fiction writing, 
politics. 


Experts in subject areas where there is:- wide cultural diversity (e.g. religion and law); 
involvement with cultural sensitivities (e.g. food); rapid changes of vocabulary (e.g. 
ecology). 


